GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089
GH-50162: [C++][Parquet] Avoid int32 overflow in BitPackedRunDecoder::GetBatch offset#50089metsw24-max wants to merge 2 commits into
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
@metsw24-max Thanks for submitting this PR! Can you open a corresponding GitHub issue as instructed by the comment bot above? @AntoinePrv Would you like to review these changes? |
AntoinePrv
left a comment
There was a problem hiding this comment.
I think ideally we would want to check if the untrusted int is within the bounds of the parquet specs, and if not error (here it may be return negative value?).
Though is the bit_width check done before? if not, should it?
| /* .bit_offset= */ static_cast<int>(bits_read % 8), | ||
| /* .max_read_bytes= */ static_cast<int>(max_read_bytes_ - bytes_fully_read), |
There was a problem hiding this comment.
In the same way that it may previously not fit in an int do we know that it will fit it at this point (and not turn negative)?
There was a problem hiding this comment.
For parser-produced runs it does hold: PeekImpl only emits a run whose whole payload fits in the remaining buffer (it truncates or rejects the run otherwise), and the BitPackedRun constructor DCHECKs that invariant. Since values_read_ <= values_count_, bytes_fully_read <= max_read_bytes_, so the difference stays in [0, max_read_bytes_], which is itself an rle_size_t. The remaining case is the negative sentinel (no bound), where the difference stays negative and unpack treats any negative value the same as -1. Added a DCHECK at the subtraction site to make that explicit.
|
|
|
@pitrou done, opened GH-50162 and renamed the title to match. @AntoinePrv on the spec bounds: the bit width is already checked before it reaches this decoder. Dictionary indices reject anything above 32 in |
int32 overflow in the bit-packed run decoder offset
GetBatch works out the byte position with
values_read_ * value_bit_widthin 32-bit int. For a large bit-packed run (this decodes untrusted parquet RLE/bit-packed dictionary indices and levels, with value width up to 64) the product passes INT32_MAX and wraps negative, so bytes_fully_read goes negative and unread_data ends up before the buffer, giving an out of bounds read in unpack. raw_data_size just above already widens to int64 before the same multiply, so I matched that here.